Language Technology for Normalisation of Less-Resourced Languages
نویسندگان
چکیده
This paper describes the stages involved in implementing a corpus of spoken Irish. This pilot project (consisting of approximately 140K words of transcribed data) implements part of the design of a larger corpus of spoken Irish which it is hoped will contain approximately 2 million words when complete. It hoped that such a corpus will provide material for linguistic research, lexicography, the teaching of Irish and for development of language technology for the Irish language.
منابع مشابه
Quizzes on Tap: Exporting a Test Generation System from One Less-Resourced Language to Another
It is difficult to develop and deploy Language Technology and applications for minority languages for many reasons. These include the lack of Natural Language Processing (NLP) resources for the language, a scarcity of NLP researchers who speak the language and the communication gap between teachers in the classroom and researchers working in universities and other centres of research. One appro...
متن کاملMorphological analysis for less-resourced languages: Maximum Affix Overlap applied to Zulu
The paper describes a collaboration approach in progress for morphological analysis of less-resourced languages. The approach is based on firstly, a language-independent machine learning algorithm, Maximum Affix Overlap, that generates candidates for morphological decompositions from an initial set of language-specific training data; and secondly, language-dependent post-processing using langua...
متن کاملEndangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR
This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...
متن کاملIntroduction to the special issue on processing under-resourced languages
The creation of language and acoustic resources, for any given spoken language, is typically a costly task. For example, a large amount of time and money is required to properly create annotated speech corpora for automatic speech recognition (ASR), domain-specific text corpora for language modeling (LM), etc. The development of speech technologies (ASR, Text-to-Speech) for the already highreso...
متن کاملBasic Language Resources for Diverse Asian Languages: A Streamlined Approach for Resource Creation
The REFLEX-LCTL (Research on English and Foreign Language ExploitationLess Commonly Taught Languages) program, sponsored by the United States government, was an effort in simultaneous creation of basic language resources and technologies for under-resourced languages, with the aim to enrich sparse areas in language technology resources and encourage new research. We were tasked to produce basic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012